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ABSTRACT 

This report evaluates the conduct, validity, and uses 
of the National Assessment of Educational Progress (NAEP) Trial State 
Assessment (TSA) ♦ The report addresses such pressing problems as how 
participation in NAEP can be maintained and appropriate samples can 
be achieved; how errors can be minimized in the complex process of 
scaling and analyzing data; how the definition of achievement levels 
can be accomplished; how inclusion of children with limited English 
proficiency or disabilities can be included and reported; how private 
schools can be included and reported; and how the NAEP state 
assessments relate to the national NAEP. After an introduction, 
sections of the report are The Content Validity of the 1994 Reading 
Assessment; Sampling and Assessment Administration for the 1994 TSA; 
The assessment of Students with Disabilities or Limited English 
Speaking Proficiency; Scaling and Analysis of the 1994 Reading 
Assessment; Reading Achievement Levels; Reporting and Dissemination 
for the 1994 Reading Assessment; and Conclusions and Recommendations. 
Contains 66 references, and 21 tables and 7 figures of data. 
Appendixes present detailed scoring guides and examples of student 
responses for sample assessment times shown in figure 2.1; reading 
experts participating in the panel's content validity study for the 
1994 TSA; and synopses of studies for the National Academy of 
Education Panel on the Evaluation of the National Assessment of 
Educational Progress Trial State Assessment. (RS) 
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Panel Chairmen 

Robert Glaser, University of Pittsburgh 
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The National Academy of Education 



The National Academy of Education is composed of scholars and education leaders 
who “promote scholarly inquiry and discussion concerning the ends and means of 
education, in all its forms, in the United States and abroad.” Our current active 
membership is limited to 125 scholars. The heart of the Academy is found in the 
lively discussions that take place in our regular meetings and in the special panels and 
committees that we establish. Throughout our 31-year history the Academy has been 
called upon by governmental and other agencies to conduct special studies and 
reviews on education issues of public interest, ranging from desegregation to the 
teaching of reading to standards-based reform. During the past six years, a panel of 
the Academy has been monitoring, studying, and making recommendations 
concerning the conduct of the Trial State Assessments given since 1990 in conjunction 
with the National Assessment of Educational Progress. For the first time in the history 
of NAEP, these assessments allow state-by-state comparisons of education 
achievement; they have proven to be of great interest to educators. The Academy’s 
Panel has prepared three in-depth biennial reports on the trial state assessments of 
1990, 1992, and, in this report, 1994. They also issued, at the request of the National 
Center for Education Statistics, which oversees the Panel’s work, a special report on 
the setting of achievement levels in connection with NAEP. 

To carry out the Panel’s challenging assignment, the Academy engaged two 
outstanding education researchers, Robert Glaser and Robert Linn, as co-chairs, and 
George Bohrnstedt and his colleagues at the American Institutes of Research as 
subcontractors to assist in the conduct of the research and writing, as well as a panel 
of distinguished educators and researchers with diverse forms of expertise on the 
NAEP program. They have constructed and overseen an ongoing set of research and 
policy papers examining major issues concerning the state trials of NAEP, including 
validity, sampling, content, data analysis, and reporting issues. In the coming year, the 
Panel will conclude its work with a capstone report that will reflect on broad, long- 
term issues regarding the future of state NAEP and its relationship to national NAEP. 



Carl F. Kaestle 

President, The National Academy of Education 
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Jeanne Griffith 

Acting Commissioner 

National Center for Education Statistics 

U.S. Department of Education 

555 New Jersey Avenue 20208-5653 



Dear Jeanne: 

On behalf of The National Academy of Education, I am pleased to transmit to you the 
fourth report of the Academy’s Panel on the Evaluation of the NAEP Trial State 
Assessment, entitled Quality and Utility: The 1994 Trial State Assessment in Reading. 
In it, the Panel evaluates the conduct, validity, and uses of that assessment. They 
apply their guiding principles and their research findings to many policy issues 
concerning the future of the NAEP state assessments. 

This report has been reviewed and approved by The National Academy of Education’s 
Executive Council, acting as a Committee of Readers. Like the preceding reports of 
this panel, we are confident that it will be helpful to policymakers in reaching 
thoughtful decisions about important policy issues concerning the National 
Assessment of Educational Progress. 

This report addresses such pressing problems as how participation in NAEP can be 
maintained and appropriate samples be achieved; how errors can be minimized in the 
complex process of scaling and analyzing the data; how the definition of achievement 
levels can be accomplished; how inclusion of children with limited English proficiency 
or disabilities can be included and reported; how private schools can be included and 
reported; and how the NAEP state assessments relate to the national NAEP. 

In addition to these and other consequential issues surrounding the ongoing 
administration and interpretation of state NAEP, there are a number of longer-term 
issues associated with the future of state and national NAEP and the assessment’s 
relationship to developments in American education. To address these issues, the 
Panel plans shortly to issue a capstone report. Issues surrounding the redesign of 
NAEP stem from fundamental changes occurring in the field of educational assessment 
and from the ongoing education reform movement, aimed at transforming the content 
and structure of American education. To determine NAEP’s most constructive role in 
the midst of these changes is both very difficult and very important. The Academy is 
proud of the contributions the Panel has made to that effort over the past six years, 
and we look forward to its capstone report as the culminating advice of this 
distinguished group of educators and scholars. 



Sincerely, 




Carl F. Kaestle 

President, The National Academy of Education 
Professor of Education, The University of Chicago 
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Foreword 



♦ 



Since 1990, every cycle of the National Assessment of Educational Progress (NAEP) 
has included an option for states to participate on a voluntary basis and receive 
state-level results in at least one subject area at one grade level. State NAEP 
assessments were first authorized by Congress in 1988, at which time Congress 
mandated that an evaluation of the feasibility and technical adequacy of such 
assessments be conducted for trials in 1990 and 1992. Pursuant with this legislation, 
Trial State Assessments (TSAs) were conducted in eighth-grade mathematics in 1990 
and in fourth- and eighth-grade mathematics and fourth-grade reading in 1992, and 
Congress subsequently extended the trials and the evaluation to include the 1994 
assessment as well. This report has been prepared in response to that mandate and 
provides an evaluation of the 1994 Trial State Assessment (TSA) in fourth-grade 
reading by The National Academy of Education’s Panel on the Evaluation of the Trial 
State Assessment. 

The Panel’s work on the 1994 TSA fourth-grade reading assessment has taken place in 
a period during which numerous innovations in assessment have been implemented 
at the national, state, and local levels. In this context, NAEP serves as a valuable 
independent monitor of status and trends for student achievement as our nation 
proceeds toward improved education for all children and youth. However, NAEP too 
has changed and, in order to be effective, must continue to change and adapt to the 
many requirements posed by new content, new techniques for measuring 
performance, and more inclusive coverage of the nation’s diversity. The Panel 
believes that systematic study of such innovations and their results should continue to 
be an essential part of efforts to enhance our nation’s key independent indicator of 
educational progress. 

This is the fourth of the Panel’s reports. Encompassing numerous studies and 
analytical papers commissioned by the Panel, these reports have served to inform 
technical and policy issues under consideration by Congress, the National Center for 
Education Statistics (NCES), the National Assessment Governing Board (NAGB), and 
the NAEP contractors. 

The first report, Assessing Student Achievement in the States , was issued in March, 
1992. It presented the Panel’s findings and observations on the first TSA, which was 
conducted in eighth-grade mathematics in 1990. More specifically, the report 
presented the Panel’s observations on the assessment’s content validity, sampling, 
administration, scoring and interpretation, as well as on the reporting of results to the 
public and press. While the Panel concluded that the trial was largely a success, a set 
of recommendations for changes was also included in the report. 

The second report, Setting Performance Standards for Student Achievement, issued in 
September, 1993, studied the new set of performance standards, called achievement 
levels, that were being implemented for reporting and interpreting NAEP results. The 
Panel’s report examined the process used for setting achievement levels, the validity 
and reasonableness of the 1992 achievement levels in reading and mathematics, and 
the relationship of NAEP to emerging national education standards. The Panel’s 
report, we believe, has made a valuable contribution to the continuing discussion and 
debate about how best to set performance standards on an assessment such as NAEP. 
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XV 



The Trial State Assessment: Prospects and Realities , the Panel’s third report, was issued 
in December, 1993 and examined the 1992 state assessments in reading and 
mathematics as well as critical questions surrounding the continuance of a state NAEP. 
In addition to issues of sampling and administration, content validity, and reporting, 
this report presented a set of guiding principles that could inform, not only the 
recommendations made in the report, but also discussions and decisions concerning 
the TSA made by Congress, NCES, and NAGB. 

The Panel’s forthcoming capstone report will be released in fall, 1996 and will address 
the role of NAEP in education reform and choices that confront NAEP now and as we 
approach the 21st century. Among the latter are choices about how NAEP can best 
incorporate modem understandings of the acquisition and organization of knowledge, 
exploit new technologies, accommodate individuals with special needs, and link 
with other assessment and other educationally relevant data sets to provide richer 
information on the progress of American education. 



Robert Glaser, Chair 
Robert Linn, Co-Chair 



George Bohmstedt, Project Director 
April 1996 
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Executive Summary 



Quality and Utility : The 1994 Trial State Assessment in Reading 

The National Assessment for Educational Progress (NAEP) has been the nation’s 
leading indicator of academic achievement for more than 25 years, providing fair and 
accurate information about the performance of U.S. students in core subject areas. 
With findings based on representative samples of students in grades 4, 8, and 12, 
NAEP has long been recognized as an unparalleled resource for educators, policy 
makers, and all others concerned with national trends in educational progress. 

Analyses of NAEP trend data in the past decades revealed a significant narrowing of 
the achievement gap between African American and white students during the late 
1970s and 1980s, while subsequent changes in these same data patterns alerted 
educators and policy makers to an apparent reopening of that gap in the 1990s. 
Similarly, NAEP has pointed to important trends in specific subject areas, 
documenting, for example, the declining achievement of U.S. students in science 
between 1970 and 1990, as well as the limited time spent on science instruction in 
most elementary classrooms. These latter data have provided evidential support for 
groups such as the American Association for the Advancement of Science and the 
National Science Foundation, who have argued for greater attention to science 
education in American schools. Finally, NAEP data were also cited in discussions and 
debates that, in 1989, led the governors and then, in 1994, Congress to target 
improved science achievement as an important national education goal. 

In 1988, NAEP’s role was substantially expanded. Responding to increased education 
reform activity and heightened interest in monitoring progress within the states, 
Congress lifted the prohibition against collecting and reporting NAEP data at the state 
level. Public Law 100-297 authorized voluntary state NAEP assessments on a trial basis 
for 1990 and 1992; this authorization was subsequently extended to include a third 
trial state assessment (TSA) in 1994. In recognition of the significance of the state 
NAEP experiment, Congress also called for an independent evaluation of the TSA to 
judge its feasibility, quality, and utility. Under a grant from the National Center for 
Education Statistics (NCES), The National Academy of Education (NAE) established this 
Panel to undertake the evaluation. 

Three previous Panel reports, spanning the first two TSAs, have been submitted to 
Congress. In them, the Panel concluded that the TSAs were successful and should be 
continued. Areas for further study were also identified, including areas in which the 
full consequences would not be evident before the trials were scheduled to end. 
Accordingly the Panel, in its most recent report (on the 1992 TSA), called for a 
continuing evaluation and — having noted that “many of the factors affecting the 
quality and feasibility of state NAEP are the same as those affecting national NAEP” 1 — 
proposed that the evaluation be expanded to include the full NAEP program. It also 
recommended continuing research and development in the important area of 
performance standards. 



1 The National Academy of Education, The Trial State Assessment: Prospects and Realities (Stanford, CA: 
Author, 1993), 104. 
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With the Improving America’s Schools Act of 1994, Congress adopted many of the 
Panel’s recommendations. Importantly, the legislation authorized NAEP state 
assessments, mandated the continuing independent review of the entire NAEP 
program, and directed that the state assessments and achievement levels be used on a 
developmental basis until the Commissioner of Education Statistics made a final 
determination of their validity and utility. 

In this, its fourth report, the Panel presents recommendations and findings specific 
to its evaluation of the 1994 TSA in fourth-grade reading and offers several general 
conclusions regarding the state NAEP assessments. This fall, the Panel will 
conclude its work by releasing a capstone report that builds on these conclusions, 
and on its previous reports, to address issues in the redesign of NAEP for the year 
2000 and beyond. 



Dimensions of the Evaluation 



In preparing this report, the Panel has drawn upon its extensive experience with the 
previous TSAs as well as studies and papers commissioned specifically for the 1994 
assessment. In particular, the Panel found that the guiding principles articulated in its 
third report to Congress remain highly relevant and continue to shape the perspective 
for its evaluation. These principles, revised and regrouped for clarity, are presented 
on pages 3 through 5 of this report. 

The Panel gathered evidence regarding several dimensions of the 1994 assessment and 
weighed them against its principles. Some of these dimensions — content validity, 
sampling, assessment administration, and reporting and dissemination — have been 
central to the Panel’s considerations of each of the previous TSAs. For 1994, the Panel 
also updated its conclusions regarding the National Assessment Governing Board 
(NAGB) achievement levels and added new emphases on scaling and analysis and on 
the assessment of students with disabilities or limited English proficiency. 

These various dimensions are discussed in chapters 2 through 7 of the report; the 
Panel’s primary conclusions and recommendations on each are presented below. 



Content Validity 

The 1994 NAEP reading assessment marked the second use of the reading framework 
developed in 1991. A portion of the item pool was released to the public and 
replaced between the 1992 and 1994 assessments, but the overall parameters of the 
assessment were held constant, allowing reading trends to be measured for the first 
time on tasks that reflect current understandings of reading and reading assessment. 
The Panel reviewed the framework and items for content validity after each of the two 
assessments and, in each instance, concluded that the NAEP reading assessment was a 
reasonable representation of current theories in reading, a reasonably valid measure of 
reading achievement in the nation, and relevant to everyday classroom practice. The 
Panel commends NAGB and NCES for building a challenging assessment of reading 
achievement that extends beyond simple mastery of the mechanics of reading to 
include the reader’s ability to draw meaning from text and to communicate this 
understanding to others. 
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Furthermore, the Panel concluded that the decision to hold frameworks in reading and 
other content areas constant over several assessment cycles was praiseworthy — a 
judgment that was confirmed by the strong interest of NAEP constituents in using 1994 
results to gauge the progress of their students over time. Based on these findings, the 
Panel recommends that the general structure (framework) of the present reading 
assessment be maintained through the year 2000 or 2002. 

During the evaluation, the Panel’s reading experts also noted aspects of the 
framework and item pool that could be improved, although none of these 
shortcomings were sufficient to undermine the content validity of the 1994 assessment 
in a substantial way. More specifically, the 1994 fourth-grade assessment contained 
relatively few items that were within the scope of the least able students, making it 
difficult to get precise and reliable estimates of achievement for those at the lower end 
of the NAEP scale. Some unevenness of item quality was also observed. Specifically, 
some of the scoring guides for constructed-response items were inconsistent with 
other features of the items or with the directions given to students, and a number of 
the more difficult items failed to capture the essential features of advanced reading 
achievement. The Panel judges each of the above to be areas in which the NAEP 
contractor should begin improvements immediately, in preparation for the next NAEP 
reading assessment. 

Finally, the Panel points out, as it has in its previous reports, that under the current 
funding and development process there is little time for planned, farsighted content 
development. In particular, during years when new frameworks are adopted, the 
NAEP contractor has typically had less than six months in which to develop field test 
materials before these materials must be finalized. Nevertheless, assessment tasks that 
are produced during this one brief period set the tone for all future assessments until 
the next revision of the framework, eight or ten years in the future. 

The Panel therefore recommends that, for every NAEP subject area, NAGB and NCES 
adopt a process that allows new research and development to begin several years 
before the framework is scheduled to be revised and a new trend line begun. This 
research and development could progress in a relatively modest manner through 
successive pilot studies and small-scale trials targeted at particularly challenging 
research problems. When it is time to begin the actual revision, Congress, NAGB and 
NCES should allow for a framework and item development cycle that is substantially 
longer and more integrated than the current one. The Panel further recommends that 
Congress forward fund NAEP in order to facilitate this process. 



Sampling and Assessment Administration 



As it had in 1990 and 1992, the Panel once again concluded that both sampling and 
administration for the 1994 TSA were done well, were generally consistent with best 
practice for major surveys of this kind, and, with the exceptions noted below, produced 
valid and useful state results. Two areas of concern were identified, however. 

First, and most important, substantial problems were found with the samples of 
nonpublic schools that were — at the Panel’s previous recommendation — added to the 
TSA for the first time in 1994. The Panel’s motivation to include nonpublic schools 
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was based on its inclusiveness principle , and the Panel’s intention was to aggregate 
the nonpublic school results with those from public schools in order to generate better 
overall state composites. However, NCES adopted more extensive reporting plans 
after determining that it would be difficult to recruit nonpublic schools without 
offering them separate reports of student achievement by type of school control. 

Upon reviewing the evidence, the Panel concluded that the 1994 state samples of 
nonpublic schools were not large enough to support separate reporting. In addition, 
participation rates for originally-sampled nonpublic schools were unacceptably low in 
approximately 40 percent of the states, and final samples were biased, in many cases, 
by the fact that certain kinds of nonpublic schools were much less likely to participate 
than others. 

The Panel recommends that NAGB and NCES stop separate reporting of state-level 
nonpublic school results but, where participation rates are sufficiently high, continue 
reporting state-level results for public and nonpublic schools combined and for public 
schools only. Furthermore, the reports should include prominent warnings about the 
invalidity of simplistic comparisons between public and nonpublic schools in order to 
discourage efforts to derive such comparisons by subtracting public school means 
from the combined public and nonpublic school results. These warnings should be 
illustrated by concrete examples to underscore their significance. 

At the same time, NAGB and NCES should explore alternative strategies (other than 
separate state-level reporting) for motivating the participation of nonpublic schools. 
One proposed course of action would be to offer more detailed reporting of private 
school results at the national level by basing the analyses on aggregated data from the 
national and state samples of nonpublic schools. (For example, NAEP could break out 
results by more detailed categories of private schools.) 

A second area of concern to the Panel involves the participation of public schools. 
Although the Panel found that in 1994, for nearly all states, the participation rates for 
originally-sampled public schools ranged from acceptable to good, strong indications 
have emerged that the data collection burden on the states, especially small states, 
may begin to threaten school and hence state participation rates in years when 
multiple subjects and grades are assessed. 

The Panel recommends that NCES and NAGB consider design changes that could 
decrease sample size requirements or otherwise reduce burden without compromising 
the overall quality of the assessment. Applicable design changes could include 
relatively circumscribed modifications, such as applying the principles of finite 
sampling to create a different set of rules for the smallest states. Reduced respondent 
burden could also be effected as one outcome of a more radical redesign of NAEP, 
and various versions of the latter are currently being debated by NAGB and other 
interested parties. 



The Assessment of Students with Disabilities or 
Limited English Proficiency 



The 1994 assessment cycle occurred at a time when NCES and NAGB were beginning 
to re-examine NAEP policies regarding the exclusion and assessment of students with 
disabilities or limited English proficiency (IEP and LEP students). In 1994, NCES 
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gathered data from several sources and met with representatives of the disability and 
bilingual communities to discuss the best methods for increasing inclusion in NAEP. 
The Panel, for its part, collected new data for samples of fourth-grade IEP and LEP 
students who had been selected for participation in the TSA, then shared its 
preliminary findings with NCES. The results of these various efforts led to a set of 
revised exclusion procedures and new allowances for accommodated assessment that 
were tried out in the 1995 field test and implemented, in a controlled design, in 1996. 

The Panel’s 1994 study indicated that school personnel in different states tended to 
interpret the (old) exclusion guidelines differently. Thus, on average, IEP students with 
the same level of ability would be included in some states and excluded in others. 
The Panel also found that a high proportion of IEP students (perhaps as many as 85 
percent) could read well enough to participate in NAEP and be included in estimates 
of overall state achievement. However, the current NAEP reading test is not 
particularly well suited to the reading abilities of the many IEP students who are 
reading a grade or more below grade level. A more appropriate measure for these 
students would address the same reading outcomes but be based on less difficult 
reading passages. 

The Panel’s study of LEP students indicated that a significant proportion of LEP 
students also read well enough in English to participate for the purpose of 
contributing to overall state NAEP results. Disturbingly, among the LEP students 
sampled for the Panel’s study (which was limited to LEP students who had attended 
English-speaking schools for at least two years), more than half had been excluded 
from the TSA. This was true even though more than three-quarters of the Panel’s 
sample had been in English-speaking schools for more than four years — essentially 
their entire school careers. The Panel acknowledges that NAEP may not offer these 
students an optimal opportunity to demonstrate their competence, particularly in 
content areas such as mathematics or science. In reading, however, the Panel believes 
that it is reasonable to ask how well these students are able to read in English. 
Moreover, the education fortunes of LEP students may too easily drop from sight if 
they are excluded from major assessment efforts. 

Finally, the Panel’s studies found that teachers of both IEP and LEP students were 
likely to propose testing accommodations for high percentages of their students. Thus, 
when accommodations are offered, inclusion may be increased, but the overall 
numbers of students assessed under standard conditions may actually go down.’ This is 
problematic because scores obtained under nonstandard conditions are much more 
difficult to interpret. 

Based on its research findings, the Panel makes the following recommendations. 

1. NCES and NAGB should continue efforts to encourage greater participation of 
students with disabilities or limited English proficiency in the current NAEP 
assessments. At the same time, they should continue research to identify adaptations 
or accommodations for each of these groups that would provide more valid measures 
of subject-area achievement as specified in the NAEP content frameworks. 

2. Results for students with disabilities or limited English proficiency assessed under 
standard conditions should be aggregated with results for all other students in 
producing the overall and subgroup achievement estimates normally reported for the 
nation and the states. The results for these populations should not be disaggregated or 
reported separately. 
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3. NAEP should also work to develop assessments that can measure accurately over a 
broader range of student proficiency levels and thereby provide better estimates at 
both ends of the achievement distribution. For efficiency, such an assessment would 
almost certainly require some adaptive mechanism (computerized or otherwise) for 
matching students with assessment tasks appropriate to their levels of proficiency. 



Scaling and Analysis 



The procedures used for scaling and analysis in the TSA are generally the same as 
those used in the national NAEP, and analyses for the two assessments are largely 
interconnected. In 1994, two technical errors affecting state scores were discovered, 
and an unexplained but statistically significant drop in performance was observed in 
the national reading results at grade 12. These occurrences led the Panel to give 
greater attention to scaling and analysis in its evaluation of the 1994 TSA than it had in 
its previous evaluations. 

In general, the Panel concluded that NCES and its contractors continue to make use of 
sophisticated methods to solve challenging measurement problems posed by recent 
innovations in testing and to produce generally high quality data. At the same time 
however, the system appears to be showing strains that allow errors to creep in, in 
addition to lengthening the time to reporting. Factors contributing to these strains have 
included pressure for frequent enhancements to the assessments, increased analysis 
volume, and policy pressure to reduce time to reporting. 

Recent efforts to report short-term trends based on the main NAEP assessments have 
also uncovered some potential problems, related to the fact that many small 
modifications in items and procedures have been permitted between assessments. In 
particular, the accuracy of the 1994-to-1992 12th-grade equating may have been 
affected because the proportion of multiple-choice items was substantially higher in 
the common item pool that served as the basis for the equating than it was in the 
overall assessment. If the two item types indeed measure somewhat different skills, 
then the link was not a good proxy for the whole. 

While no one major problem with the 1994 scaling or analysis was observed, the 
accumulation of smaller problems suggests that a modified assessment design would 
better fit the size and objectives of the current NAEP program. The Panel therefore 
supports NAGB’s efforts to develop a new, more streamlined design for NAEP. 

In the meantime, the Panel makes the following recommendations to help ensure the 
integrity of NAEP results: 

1. Any significant change in performance on the short-term trends should routinely be 
checked for reasonableness against other sources of trend data — sources such as the 
long-term NAEP trend data and state assessment trend data— before the results of the 
short-term trend are reported. 

2. NCES should conduct or commission additional studies to validate the current 
analysis and scaling models. These studies should include research on the strength of 
the models being employed and the robustness to violations of assumptions. 
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3. Additional procedures designed to verily the integrity of the NAEP data prior to its 
release should be investigated, and NCES should continue to give priority to the timely 
release of high quality technical reports that provide thorough documentation of all 
design related, technical, and psychometric activities associated with the assessments. 



Achievement Levels 

♦ 

The achievement levels established by NAGB for the 1992 reading assessment were 
again used for reporting the 1994 assessment. The Panel realizes that reporting by 
performance standards is greatly valued by much of the NAEP (and TSA) constituency. 
Nevertheless, the Panel continues to question the reliability and validity of the current 
achievement levels. At the time of its evaluation of the 1992 achievement levels, the 
Panel concluded that 1) the standard-setting method had led to serious internal 
inconsistencies that could have especially troubling consequences if the mix of item 
types changed over time and 2) the distributions of student performance established 
by the achievement-level cutscores was not reasonable based on comparison to the 
distributions suggested by various non-NAEP measures. In particular, the weight of 
the evidence suggested that the 1992 achievement levels were set too high. 

Although the achievement-levels contractor fielded a study in 1994 that putatively 
addressed the second of these concerns, the Panel concluded that the design of the 
study did not permit confirmation of specific cutscores. The study was therefore not 
particularly informative with respect to the Panel’s conclusion that the cutscores had 
probably been set too high. 

The Panel also examined the results for the 1994 U.S. history and world geography 
achievement levels in order to determine whether they would exhibit better internal 
consistency or a better match to external criteria than had the 1992 reading and 
mathematics achievement levels. 2 In fact, the Panel once again found troubling 
differences in achievement-level cutscores set using dichotomous versus partial-credit 
(extended-response) items. Although not as dramatic as the differences found for the 
1992 achievement-levels, the 1994 results again showed that levels set using extended- 
response items were considerably higher than those set using multiple-choice or 
dichotomously-scored constructed-response items. 

The Panel also examined the achievement levels in relation to performance on the AP 
examination in U.S. history.^ Many colleges and universities give college credit for AP 
courses taken in high school if students score three or better, and the Panel found that 
2.8 percent of the country’s high school seniors met this criterion on the AP U.S. 
history examination in 1994. By contrast, NAEP classified only 1 percent of high 
school seniors at the advanced level in this subject. Moreover, the percentage passing 



2 U.S. history and world geography were assessed nationally in 1994 but were not included in the TSA. 

It was not therefore in the Panel’s purview to conduct a formal evaluation of the achievement levels set 
for these subjects. However, to the extent that the data were readily available, the Panel believed it 
should determine whether or not the results from these new level-setting efforts confirmed the Panel’s 
earlier findings. 

^ Only U.S. history could be considered because no AP examination is offered in world geography. 



the A P criterion would have been even higher if A P programs had been available in 
all U.S. high schools instead of only half of them. These findings provide additional 
evidence that the Governing Board's achievement levels are set too high, that is, that 
the achievement levels identify fewer 12th graders as advanced than actually are 
performing at an advanced level. 

Based on its accumulated evidence concerning the achievement levels and the process 
by which they were set, the Panel makes the following recommendations. 

1. NAGB should institute a competition for the design of new methods for setting 
performance standards for all NAEP subjects with the goal of having a new method in 
place by the time of the year 2000 NAEP assessment. 

2. In the interim, current achievement levels should be accompanied by a warning 
stating that results should be interpreted as suggestive rather than definitive because 
they are based on a methodology that earlier evaluation panels have questioned in 
terms of accuracy and validity. 



Reporting and Dissemination 

In considering the quality of NAEP reporting since the inception of the TSA, the Panel 
has identified four criteria fundamental to successful reporting: 

♦ The accuracy of the results; 

+ The likelihood that the results will be interpreted correctly 
by the intended audience; 

♦ The extent to which the results are accessible and adequately 
disseminated; and 

♦ The timeliness with which the results are made available. 



With respect to the first three criteria, NCES, NAGB, and the NAEP contractors have 
made steady progress. For example, innovative graphic formats intended to convey 
the statistical significance or insignificance of differences between states and across 
time have been tried after each TSA, and the map graphics introduced in 1992 proved 
more successful than earlier efforts. The 1994 reports retained most of these graphics 
and also addressed other concerns about the interpretability and accessibility of the 
results by introducing more charts, visually simplifying the data tables, using more 
white space, and generally shortening the reports. NCES has also begun a series of 
focused reports that highlight specific findings from each assessment cycle, and these 
also have been well received. Two such reports have been scheduled for the 1994 
reading assessment. Further, in response to the expressed need of the state assessment 
directors for a brief and readable summary of results that they could distribute to 
educators and policy makers in their states, NCES produced a four-page brochure that 
was released with the main reading reports in March 1996. 
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The 1994 TSA was particularly problematic, however, with respect to timeliness. 
Despite efforts by NCES, NAGB, and the NAEP contractors to speed up reporting, the 
new First Look report, which contained only summary findings for the 1994 reading 
assessment, was not released until April, 1995 (13 months after the administration). The 
main reading reports did not appear for nearly another year after that — the longest lag 
between assessment and reporting that has occurred to date. Factors which contributed 
to the delay included unexpected data problems, shifting program priorities, and 
competition for the services of qualified analysis staff. The Panel strongly encourages 
NCES and NAGB to continue to press for quicker and more timely reporting while also 
being careful to maintain the quality and integrity of the data. 



Utility^ of the TSA 



The final perspective that bears on the overall evaluation of the TSA, and in effect 
subsumes all other perspectives, concerns its utility. As suggested above, utility must 
rest, firstly, on the validity and reliability of the data. Beyond this, the results must be 
timely, accessible, and policy relevant, and the program must be perceived as useful 
and valuable by the major customers of the information it provides — particularly the 
states. To investigate the latter, the Panel commissioned surveys and case studies of 
NAEP’s perceived influence after the release of each round of TSA data, concluding 
with a set of case studies and a mail survey of state assessment directors, mathematics 
specialists, and reading specialists in December, 1995. Throughout its evaluation, the 
Panel also monitored media coverage of NAEP and the TSAs and followed the 
opinions and actions of other NAEP stakeholders. 



Utility of NAEP Data to the States 

For the most part, the Panel concluded from these efforts that state NAEP has become 
a valued indicator of educational progress and has served particularly to provide an 
independent validity check on the states’ own assessments. In Rhode Island, for 
example, the state reading specialist reported that the 1994 TSA reading results 
provided important evidence for the success of an ongoing reading initiative. The 
external monitor role has been especially important during a period when many state 
assessments have undergone radical reform, making upward or downward trends in 
their results particularly difficult to interpret. 

Several factors contribute to state NAEP’s credibility and hence its value to the states 
as an external monitor. These include the assessment’s forward-looking content and 
format, the secure status of its testing materials, and the rigorous statistical standards 
maintained in data collection, analysis, and reporting. (The long lag time to reporting, 
and the lack of a stable assessment schedule against which states can plan, however, 
are two factors cited repeatedly by the states as diminishing state NAEP’s utility.) 

When state NAEP results have yielded dramatic or unexpected results, particularly 
when a state’s students performed worse than expected, considerable public debate 
has followed. North Carolina and California both provide notable, and very different, 
examples of this effect. 
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In 1990, North Carolina educators were dismayed to discover that the state’s students 
had done much worse on the NAEP mathematics assessment than the educators had 
expected, based on results from the state’s own assessment, a commercially available, 
norm referenced test. During the subsequent debate and discussion, decision makers 
concluded that North Carolina teachers generally lacked certain key understandings 
that were required to implement their recently introduced, forward-looking 
mathematics curriculum successfully. Information from the NAEP background 
questions on instructional practices helped North Carolina reach this conclusion, and 
the state subsequently undertook an intensive in-service training program that was 
based in part on materials and data from the 1990 NAEP. These remediation efforts 
appeared to be successful in that the 1992 TSA showed a significant gain in eighth- 
grade mathematics achievement for the state. 

In California, educators and the public were also shocked when fourth-grade reading 
achievement estimates from the 1994 TSA showed California performing significantly 
worse than it had done in 1992, and positioned virtually at the bottom of the 
distribution of participating states. This information was particularly important in view 
of the fact that California’s own assessment system has been in disarray for the past 
several years, precluding any meaningful assessment of performance trends from that 
particular source. However, in the resultant furor, most commentators simply pointed 
to the TSA results as further evidence for what they already felt was wrong with the 
state’s education system — whether that was crowded classrooms or the state’s whole 
language reading curriculum. 

Besides using NAEP as an external monitor of achievement, about 60 percent of the 
states that undertook revisions to their mathematics or reading curricula during the 
past five years reported NAEP as a notable source of ideas. Similar numbers referred 
to NAEP as a model, or a source of external validation, for changes to their reading or 
mathematics assessments. State educators, for example, have closely followed NAEP’s 
pioneering efforts to set performance standards, and both assessment directors and 
curriculum support staff have used NAEP’s external credibility to argue for such 
desired objectives as better alignment with National Council of Teachers of 
Mathematics (NCTM) standards or with reading standards based on reading for 
meaning, higher order skills, and real-world reading tasks. 



Contributions to the National Debate 

♦ 

Interestingly, state NAEP has broadened NAEP’s influence not only at the state level, 
as might be expected, but also at the national level. NAEP has been adopted by the 
National Goals Panel as the primary indicator of progress towards goal three, which 
states that “By the year 2000, American students will leave grades four, eight, and 
twelve having demonstrated competency in challenging subject matter...” 4 

NAEP also routinely receives national press coverage after each of its major data 
releases. The coverage has tended to be more widespread when regional media are 
able to report on results for their own states as well as for the nation. Publications 



4 National Education Goals Panel, The National Education Goals Report: Building a Nation of Learners 
(Washington, D.C.: Author, 1991 ), 10 . 
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devoted to education news, such as Education Week ; also contain frequent references 
to NAEP, both as a unique source of information about education achievement and as 
a model for current assessment practices. 



The Impact of State on National NAEP 

When state NAEP was authorized by Congress on a trial basis in 1988, one of Congress’ 
central concerns was whether state NAEP would have a deleterious effect on national 
NAEP. By asking this question, Congress was tacitly affirming the importance of 
protecting the integrity of national NAEP and expressing a concern that state NAEP 
might have a negative impact on state participation in national NAEP, especially in the 
case of small states. There is little evidence that this has happened to date. 

Rather, the Panel believes that an implicit, mostly unspoken quid pro quo has 
developed between the states and NAGB, by means of which the states are willing to 
participate in national NAEP at least in part because of the value they get from 
participation in state NAEP. Since 1990, the Panel has observed movement from 
guarded cooperation among participating states to general anticipation when state 
NAEP results are about to be released. Positive attitudes toward state NAEP can only 
grow if NCES and NAGB are successful in addressing the relatively few persistent 
concerns, such as the uncertainty of the assessment schedule, that states have cited 
repeatedly. As a result, the Panel suggests that, in the unlikely event that Congress 
were to recommend the abandonment of state NAEP, the motivation of the states to 
continue in national NAEP could drop precipitously. 

As a result, in contrast to its original conclusion at the end of the evaluation of the 
1990 TSA, which was simply that state NAEP had had no deleterious effect on national 
NAEP, the Panel now believes that the future of national NAEP has become 
intertwined with the future of state NAEP. State NAEP has greatly increased the 
visibility and perceived utility of the entire NAEP program, and suggestions for 
merging the state and national samples continue to arise (although it is not evident 
that such a merger would be feasible or significantly reduce burden). 

There is obviously also an interaction between monies spent on state NAEP and 
monies available to maintain a quality national NAEP program, but the nature of this 
interaction is complex. On the one hand, the substantial funds spent for state NAEP 
cannot then be spent for other NAEP activities. On the other hand, the heightened 
visibility conferred by state NAEP may result in a net increase in national NAEP 
resources. For example, the substantial framework and item development efforts that 
have characterized the last several years have benefited both programs and almost 
certainly would not have been funded without the impetus of state NAEP. 



The Panel’s Recommendation for the 
Continuation of State NAEP 



Based on its evaluation of the TSA s, the Panel concludes that state NAEP has been 
shown to be a valid, reliable, and useful measure of student achievement, and that it 
aligns favorably with the Panel’s quality, utility, and state indicator principles. For 
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these reasons, the Panel recommends that state NAEP be continued, and that it be 
moved from developmental to permanent status when NAEP is next reauthorized. 
However, in light of its size and cost, the Panel further recommends that the scope 
and function of state NAEP be reviewed regularly, and particularly after any 
substantial change in mission or design. Such re-evaluation should be done in the 
context of the overall NAEP program and with the abiding aim of providing the best 
and most useful information about Student achievement for the nation. 

There are areas, however, in which it is not yet possible to determine the course that 
will best serve NAEP’s mission and goals. Some of these areas, which should continue 
to be examined in the near future, include 

♦ The viability of continuing to assess nonpublic schools in the state 
NAEP program; 

♦ The value and feasibility of grade-12 state assessments; 

♦ The tension between including as many students with disabilities and 
limited English reading skills in the assessment as possible, and the 
cost of doing so; 

♦ The adequacy of the present NAEP design to meet the increasing 
demands of NAEP’s stakeholders while still satisfying the Panel’s 
quality principle; and 

♦ The development of improved performance standards for reporting 
NAEP results. 



In the fall, 1996, the Panel will present its capstone report. Building upon the Panel’s 
previous work, the report will look forward to the year 2000 and beyond, considering 
recommendations for the design of a NAEP program that offers quality assessments for 
the nation and the states and also anticipates the changing nature of education 
practice as the latter will be influenced by technology and by our developing 
knowledge of learning and human cognition. 
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1 Introduction 



♦ 



The Context for the Panel’s Evaluation of the 1994 TSA 



Since its inception in 1969, the National Assessment of Educational Progress (NAEP) 
has been the nation’s leading indicator of what American students know and can do. 
The high technical quality of the assessment and its independence from education and 
political pressures have enabled NAEP to reliably monitor changes in education 
achievement and practices for nearly three decades. Moreover, as the only education 
assessment administered to a representative sample of American students, NAEP has 
been able to track changes not only for the population as a whole, but for important 
subgroups as well. For example, analyses of NAEP trend data in the past decades 
revealed a significant narrowing of the achievement gap between African American 
and white students during the late 1970s and 1980s, while subsequent changes in 
these same data patterns alerted observers to an apparent reopening of that gap in the 
1990s. Similarly, NAEP has pointed to important trends in specific subject areas, 
documenting, for example, the declining achievement of American students in science 
between 1970 and 1990, as well as the limited time spent on science instruction in 
most elementary classrooms. These data have provided evidential support for groups 
such as the American Association for the Advancement of Science and the National 
Science Foundation, who have argued for greater attention to science education in 
American schools. Finally, NAEP data were also cited in discussions and debates that 
led the Governors and then Congress to target improved science achievement as an 
important national education goal. 

The National Assessment of Educational Progress has thus been a long standing and 
useful source of information for national education reformers and policy makers. It 
was not until 1988, however, that NAEP could begin to play a similar role for the 
individual states. Prior to the passage of Public Law 100-297 in that year, NAEP was 
prevented both by its design and by its mandate from collecting and reporting data at 
the state level. That prohibition was lifted by Congress in 1988 with its authorization 
of an experimental new component to the NAEP program — the Trial State Assessments 
(TSA). This action reflected both the educational and the political context of the 
times. Specifically, by approving and funding the state-by-state assessment, Congress 
was responding to increased education reform activity in the states and to expanded 
interest in using state-level data to monitor progress. At the same time, by making 
participation in the TSA voluntary, Congress demonstrated its continuing respect for 
the constitutional authority granted to the states for the education of their residents. 
Finally, by authorizing state NAEP on a “trial” basis only, Congress showed recognition 
that such a massive expansion of the NAEP program was unproven in its direct 
effectiveness and in its overall impact. Congress further underscored this recognition 
by mandating that the trials be independently evaluated so as to determine whether 
they provided valid, reliable, and useful data for the states. 

The 1994 state NAEP assessment in reading brings to a close the series of TSAs 
authorized under the original 1988 legislation and subsequently extended through 
1994. With the conclusion of the TSAs comes also the conclusion of the evaluation of 
the program under the auspices of The National Academy of Education (NAE) Panel 
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on the Evaluation of the NAEP Trial State Assessment. The purpose of this report is 
twofold: first, to present the Panel’s specific findings and recommendations stemming 
from its evaluation of the 1994 TSA in fourth-grade reading and second, to offer 
several general conclusions regarding the larger TSA program. In a subsequent 
capstone report, the Panel will build on these conclusions to provide analysis and 
recommendations for the entire NAEP program in the year 2000 and beyond. 



History of NAEP TSA Evaluations 



Over the past six years, the NAE Panel has seen the NAEP state assessment program 
grow, albeit somewhat erratically, and become a highly valued feature of the NAEP 
program. In the face of budgetary constraints however, the final 1994 trial, like the 
first, included only one subject at one grade. The largest trial, in 1992, covered two 
subjects at grade 4 and one at grade 8. A trial at the 12th grade was never carried out. 
(See table 1.1.) 

Table 1.1. Grades and subjects assessed in tbe NAEP trial state 
assessments 1990-1994 





1990 


1992 


1994 


Grade 4 




Reading 


Reading 






Math 




Grade 8 










Math 


Math 





Participation by states and other jurisdictions has increased with each assessment 
cycle, reaching a new high of 46 participating jurisdictions with the just completed 
1996 state assessment, which was also the first state assessment that was not labeled 
as a “trial.” Additionally, state trend results, which have been available from every 
assessment since 1992, have helped to sustain the attention of user groups, who 
continue to show strong interest in the results. 

Throughout this period, the NAE Panel has been involved in an ongoing evaluation of 
the TSA program. Three reports were submitted to Congress over the course of the 
first two TSAs. In them, the Panel concluded that the TSAs were successful and 
should be continued. Certain areas for improvement or further study were also noted 
however, including areas in which the full consequences could not be ascertained 
before the trials were scheduled to end. Furthermore, the Panel noted in the last of 
these reports, The Trial State Assessment: Prospects and Realities, that “many of the 
factors affecting the quality and feasibility of state NAEP are the same as those 
affecting national NAEP.” 1 Accordingly, the Panel called for continuing evaluation of 
the full NAEP program, including the state assessment component. It also called for 
continuing research and development in the important area of performance standards. 



1 The National Academy of Education, The Trial State Assessment: Prospects and Realities (Stanford, CA: 
Author, 1993), 104. 
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